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Objective: Anemia is a frequent comorbidity in hemodialysis patients that can be successfully treated by administering 
erythropoiesis-stimulating agents (ESAs). ESAs dosing is currently based on clinical protocols that often do not account for 
the high inter- and intra-individual variability in the patient’s response. As a result, the hemoglobin level of some patients oscillates 
around the target range, which is associated with multiple risks and side-effects. This work proposes a methodology based on 
reinforcement learning (RE) to optimize ESA therapy. 

Methods: RE is a data-driven approach for solving sequential decision-making problems that are formulated as Markov decision 
processes (MDPs). Computing optimal drug administration strategies for chronic diseases is a sequential decision-making problem 
in which the goal is to find the best sequence of drug doses. MDPs are particularly suitable for modeling these problems due to their 
ability to capture the uncertainty associated with the outcome of the treatment and the stochastic nature of the underlying process. 
The RE algorithm employed in the proposed methodology is fitted Q iteration, which stands out for its ability to make an efficient 
use of data. 

Results: The experiments reported here are based on a computational model that describes the effect of ESAs on the hemoglobin 
level. The performance of the proposed method is evaluated and compared with the well-known Q-learning algorithm and with a 
standard protocol. Simulation results show that the performance of Q-learning is substantially lower than EQI and the protocol. 
When comparing EQI and the protocol, EQI achieves an increment of 27.6% in the proportion of patients that are within the targeted 
range of hemoglobin during the period of treatment. In addition, the quantity of drug needed is reduced by 5.13%, which indicates 
a more efficient use of ESAs. 

Conclusion: Although prospective validation is required, promising results demonstrate the potential of RE to become an alternative 
to current protocols. 

Keywords: reinforcement learning, Markov decision processes, fitted Q iteration, chronic kidney disease, renal anemia, 
darbepoietin alfa 


1. Introduction 

Anemia is a common complication characterized by a re¬ 
duced concentration of hemoglobin (Hb) that occurs in over 
90% of patients undergoing hemodialysis IH]. Hemodialysis 
is the most common treatment for patients in advanced stages 
of chronic kidney disease (CKD), particularly in its end state, 
commonly referred as end-stage renal disease (ESRD). In the 
last years the prevalence of ESRD has increased substantially, 
reaching more than 1000 per million population in most of 
the developed countries ||2l. In some countries, such as USA 
and Japan, the current prevalence is over 2000 per million ||2l. 
ESRD involves a gradual loss of kidney function over time, 
which produces, among other health problems, a poor produc¬ 
tion of erythropoietin (EPO). This hormone regulates the red 
blood cell (RBC) production, a class of cells rich in Hb. Low 
Hb levels are associated with heart disease, poorer overall qual¬ 
ity of life, and increased mortality Hi. 

Current standard treatment of anemia consists mainly of 
the administration of erythropoiesis-stimulating agents (ESAs). 
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The response to this kind of drugs is known to have a large 
inter- and intra-interindividual variability due to differences in 
background characteristics, disease severity, comorbidities and 
concurrent medications fill]. Although there exist protocols 
to help physicians determine the appropriate dose, achieving 
stable Hb levels within the target range can be complex and 
often requires dose titration. Results from several studies sug¬ 
gest that a phenomenon known as Hb cycling is a common oc¬ 
currence in ESA-treated patients Qi. Hb cycling is defined 
as the cyclical, repeated, up and down movement of Hb lev¬ 
els during ESA treatment. The exact causes of Hb cycling are 
not yet completely understood; however, a number of possible 
reasons have been proposed. Eishbane and Berns suggested 
two ESA management practices as major causes. Eirst, the use 
of rigid dose adjustment protocols that do not account for the 
high heterogeneity in patient response. Second, narrow Hb tar¬ 
get ranges recommended in clinical guidelines lEIil, which 
need frequent dose changes. The effect of an ESA dose change 
does not reach a steady state until 70-120 days (RBC lifespan). 
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When doses are changed frequently, it is difficult to take into 
account the long -term effects of each dose, and often they are 
ignored 1112111311 . The link between Hb cycling and the devel¬ 
opment of several diseases ||3l together with the high cost of 
the treatment (e.g., around $2.3 billions per year in USA ||2l) 
justihes the need to improve current protocols. 

The widespread use of electronic medical records is giving 
rise to large amounts of data that could be useful to reduce 
medical errors, improve treatments and minimize side effects 
and costs llldll . This work proposes a methodology based on 
reinforcement learning (RL) to optimize ESA therapy. RL is 
a data-driven approach for solving sequential decision-making 
problems that are formulated as Markov decision processes 
(MDPs) lll5n . Computing optimal drug administration strate¬ 
gies for chronic diseases is a sequential decision-making prob¬ 
lem in which the goal is to hnd the best sequence of drug doses. 
MDPs are particularly suitable for modeling these problems due 
to their ability to capture the uncertainty associated with the 
outcome of the treatment and the stochastic nature of the under¬ 
lying process | U IS I ■ The standard approach to solve MDPs is 
dynamic programming (DP); however, the practical application 
of DP is limited because it cannot deal with large-scale prob¬ 
lems and requires full knowledge of the MDP model, including 
the transition probability function. In contrast, RL (also known 
as approximate dynamic programming (ADP)) uses function 
approximation to address large-scale problems and the data 
sampled from the process to implicitly represent the transition 
function lll9n . RL can exploit the information contained in med¬ 
ical records to compute policies of ESA administration tailored 
to the individual characteristics of each patient. In addition, the 
optimization process is made over sequences of doses instead 
of isolated doses, which is crucial to include the drug long-term 
effects. 


The methodology proposed in this work uses the algorithm 
fitted Q iteration to learn a policy of ESA administration from 
a set of medical records. The features employed to dehne the 
MDP model are extracted in part from the laboratory tests and 
in part from a clustering procedure of the patient’s main at¬ 
tributes. In order to test the methodology, a series of exper¬ 
iments has been conducted using a computational model that 
simulates the response of the patients. The performance has 
been assessed against the algorithm Q-learning and a standard 
protocol of dose adjustment. 

The rest of the paper is organized as follows. Next section 
provides a brief review of related work in this domain. Section|3] 
introduces the necessary background in RL and briefly ex¬ 
plains the algorithms employed in the experiments, namely, Q- 
learning and fitted Q iteration. The latter algorithm makes use 
of extremely randomized trees, a supervised learning method 
that is described in Section |4] Section |5] discusses the com¬ 
putational model used in the experiments to simulate patients’ 
response to ESA. Anemia management formulation using the 
MDP framework is presented in Section|6] Experiments carried 
out are detailed in Section]?] Section[8]shows and discusses the 
achieved results. Einally, conclusions and proposals for further 
work are given in Section^ 


2. Literature review 


The idea of using a data-driven method to optimize ESA ad¬ 
ministration is not new. Artificial neural networks have been 
used by several authors during the last decade to individualize 
ESA doses |2^22|. In general, those methods used current and 
previous Hb levels, ESA doses, and other variables that describe 
the patient’s condition, in order to predict the next Hb level. The 
goal of those previous works was to select the optimal ESA 
dose in order to achieve a given Hb level. This approach is suit¬ 
able only when the optimization horizon is the next time step. 
On the contrary, the aim of ESA therapy is the long-term Hb 
stabilization. The same idea has been applied using other ma¬ 
chine learning techniques, such as fuzzy logic |23[ 12411 . support 
vector machines ll25n or Bayesian networks ll26ll . 

Model predictive control (MPC) is a method of process con¬ 
trol whose main advantage is that it incorporates a finite time- 
horizon in the optimization process. Gaweda et al. ll^ showed 
that MPC may result in improved anemia management. A ma¬ 
jor difficulty of MPC is the requirement of an accurate system 
model. Even if the system model is available, RL has shown to 
be competitive with MPC 1281. 

RL in the context of anemia management was previously 
studied by Gaweda et al. lE^ and Martm-Guerrero et al. S. 
Both agree in the potential of RL to become an alternative to 
currently used protocols. The algorithm employed in those 
works was the popular Q-learning ii. This algorithm has 
been widely used in some fields as robotics because it requires 
little computation and can work in real time. However, Q- 
learning makes an inefficient use of the data, thus, it is not suit¬ 
able for problems in which acquiring data is costly 0211 . Pitted 
Q-iteration (PQI) 13311 is a relatively new RL algorithm that sig¬ 
nificantly reduces the quantity of data required to learn useful 
policies. Recently there has been a growing interest in apply¬ 
ing PQI to optimize the treatment of several diseases including 
HIV/AIDS O^. p sychiatric disorders ll^ . e pile psy 136. 371. 
schizophrenia OSl 13911 or smoking addiction 1401] . To the au¬ 
thors’ knowledge, this is the first work that applies PQI to the 
optimization of anemia treatment. 


3. Reinforcement learning 


Reinforcement learning (RL) is a general class of algorithms 
in the held of machine learning for solving decision-making 
problems where decisions are made in stages l4in . Such prob¬ 
lems are present in a wide range of fields, including operations 
research |42, ^|, artificial intelligence automatic con¬ 

trol or medicine 1^ . The standard RL setting consists of 
an agent (or controller) in an environment (or system). Each de¬ 
cision (also called action) produces an immediate reward. The 
agent learns to perform actions in order to maximize the reward 
collected over time. The goal is defined by the user through 
the reward function. Contrary to other approaches, RL does not 
rely on a mathematical model of the system, but is based on 
experience (or data). The agent obtains experience interacting 
with the environment. Pig. [Tjrepresents the main RL elements 
and how they interact. At each stage or discrete time-point k. 
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Figure 1; Elements of RL and their flow of interaction. 


the agent receives the environment’s state, and on that basis se¬ 
lects an action. As a consequence of its action, in the next time 
step, the agent receives a numerical reward and the environment 
evolves to a new state. The agent selects actions depending on 
the environment state using a policy that assigns an action to 
every state. Typically, the agent modifies the policy as a result 
of the interactions with the environment. 

The elements of the RL problem can be formalized using the 
Markov decision processes (MDPs) framework. Next, in Sec- 
tion l3.ll MDPs are used to formally introduce the basic compo¬ 
nents of RL. Then, the algorithms used in this work, Q-learning 
and fitted Q iteration, are briefly described in Section [L2l 

3.1. Markov decision processes 

Markov decision processes are a general mathematical 
framework for modeling sequential decision-making problems. 
An MDP is defined by the following elements; 

• A set of states, S: At each discrete time-point k, the envi¬ 
ronment occupies a state sjc e S. The state usually is com¬ 
posed by a vector whose components describe the current 
situation of the environment. As time passes, the vector 
values evolve in part as consequence of the actions ap¬ 
plied by the agent and in part stochastically. The state can 
simply be a variable observed directly from the environ¬ 
ment, or a more complex structure such as a set of vari¬ 
ables highly processed which combines information about 
the current and the past situations of the environment. 

• A set of actions. A; The agent applies an action ^ Am 
each state. The state in the next instant, is influenced 
by the current action. The actions are the mechanism em¬ 
ployed by the agent to control or guide the evolution of the 
environment. 

• A transition probability function, P : S xAxS —> [0,1): 
After action is taken in the current state, the transition 
function gives the probability of the next state, i.e., this 
function describes how the state evolves. 

• A reward function, p ; SxAxS —> K; Each transition be¬ 
tween states generates a reward = p{sk,ak, i/t+i) that 


*The only requirement is that the state should contain enough information 
to fulfill the Markov property (see Section lhi^ for more details). 


evaluates the immediate effect of the transition, but it does 
not provide information about its long-term effects. The 
reward function is defined by the user and implicitly codi¬ 
fies the goal of the agent. Notice that the reward function 
does not describe how to achieve the goal, but the agent 
must learn how to act from experience. 

Suppose a patient suffering from a certain disease that re¬ 
quires long term treatment with a particular drug. Usually, the 
aim is to administrate a suitable sequence of doses in order to 
control a variable (or several variables) related to the severity of 
the disease. Eor example, in anemic patients, Hb level is used to 
measure the degree of anemia. The MDP framework can be ap¬ 
plied to this problem modeling the patient as the environment. 
In this case, the state should contain all the information relevant 
to choose a proper treatment. In addition to the current level of 
Hb, the state may include other factors that can influence the ef¬ 
fect of the treatment such as the physical characteristics of the 
patients or their nutritional condition. The set of actions are the 
possible treatments (or drug doses) that can be administered to 
the patient. After each treatment, the patient status evolves to 
a new state. The new state will be in part a consequence of the 
treatment, and in part a consequence of other aspects that can¬ 
not be controlled by the agent, like for example the presence of 
inflammation or blood losses. If the objective of the treatment 
is to maintain the Hb level within a range, the reward function 
can be defined to provide a positive reward when the Hb level 
is between the limits of the target range and a negative reward 
in otherwise. 

The agent selects actions according to its policy n : S A. 
The policy is a function that maps states to actions, i.e., for each 
possible state of the environment, the policy indicates the action 
that should be performed: 


n{s) — a 


( 1 ) 


The objective of the agent is to learn a policy that maximizes 
the sum of rewards received over time, a quantity known as 
return. Such a maximizing policy, denoted by n*, is said to 
be optimal. The return usually is computed using the infinite- 
discounted horizon. In such a case, the return for an initial state 
So and under the policy n is ll47ll : 


R"{sq) = lim j y, /p(s*,7r(s*), st+i) > (2) 

U=0 j 

where y 6 [0,1) is the discount factor. This parameter can be 
intuitively interpreted as a way to balance the immediate reward 
and future rewards. Euture rewards are more relevant for the 
calculation of the return as y approaches 1. 

To find an optimal policy, the agent must explore the environ¬ 
ment: the probability of attempting new actions (different from 
those dictated by the policy) must always be non-zero. Other¬ 
wise, some areas of the state-action space may never be visited 
and the learning process can become stuck in a local optimum. 
The tradeoff between greedy action choices and ex plora tion is 
necessary for the performance of any RL algorithm 14811 . There 
exist several strategies to include exploration in the agent’s 
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behaviour, such as e-greedy exploration or Boltzmann explo¬ 
ration 1^]. 


3.2. Solving MDPs 

Solving an MDP means to find an optimal policy. There 
are several methods for solving MDPs, which can be grouped 
into two classes: dynamic programming (DP) and reinforce¬ 
ment learning (RL). DP methods require knowledge of the full 
MDP model. Since the transition probability function P rarely 
is available, this class of methods can only be employed in 
a limited number of practical problems. On the other hand, 
RL methods are completely based on experience, which makes 
them useful when the full MDP model is unknown or difficult 
to estimate. 

Most RL algorithms use Q-functions (also called utility func¬ 
tions) 2'' : 5 X A —> R to find an optimal policy. Given a policy, 
the Q-function for a particular pair {s, a) is defined as the ex¬ 
pected return that is encountered starting from s, taking action 
a and thereafter following policy n 1481] : 


Q^is, a) = £,- 1^,0 {p(i, a, s') + yR''{s')} 


(3) 


where s' is the state reached after taking action a in the state s. 
The Q-function measures the utility (in terms of the expected 
return) of perform each action in each state. 

The optimal Q-function is defined as the best Q-function that 
can be obtained by any policy: 


Q*(s, a) = max Q^is, a) (4) 

n 

From the optimal Q-function, an optimal policy can be easily 
derived choosing in each state the action that maximizes Q*: 


7T*(s) - arg max Q*is, a) (5) 

a 

where arg maX;^ fix) stands for the argument x that attains the 
maximum value of the function /(•). In general, for a given 
Q-function, a policy that maximizes Q in this way is said to 
be greedy in Q. Therefore, solving an MDP (i.e., finding an 
optimal policy) can be done by first finding Q*, and then using 
Eq. 0 to compute a greedy policy in Q*. 

When MDPs have a small enough number of states and ac¬ 
tions, Q-functions can be exactly stored in tables with one en¬ 
try per state-action pair. Unfortunately, many practical prob¬ 
lems contain a very large or infinite (for example when the 
state space is continuous) number of states; in such a case, Q- 
functions must be represented approximately by two reasons. 
First, suppose that the MDP contains N states, if N is very large, 
a table with N entries would be intractable due to computational 
and memory limitations. On the contrary, in a typical approx¬ 
imate representation is only necessary to store a vector of M 
parameters, being M ^ N. Second, when the state space is 
large or continuous, the agent will probably never be exactly on 
the same state more than once. Therefore, the experience ac¬ 
quired from a set of states should be generalized to other states 
that have not been seen before. In principle, any function ap¬ 
proximation (or regression) method can be used to represent Q- 
functions. However, in practice, some RL algorithms impose 
restrictions about the structure of the approximator. 


The remaining of this section describes two RL algorithms 
for solving MDPs. On the one hand, the well-known Q-learning 
algorithm 113111 . Q-learning is probably the most popular RL al¬ 
gorithm, it has been used in many applications, including the 
treatment of anemia in hemodialysis patients 1^ . On the 
other hand, fitted Q iteration ll^ . a more recent algorithm that 
forms part of the methodology proposed in this paper. Both are 
offline algorithms, which means that they do not require inter¬ 
acting with the environment during the learning phase, but in¬ 
stead can learn a solution using data collected in advance. The 
data, or experience, usually is stored as a set of transitions of the 
form (s, a, r, s') sampled from the process. This data set should 
be representative of the state-action space, i.e., they should con¬ 
tain a certain degree of exploration. Given that the agent cannot 
interact with the environment when RL algorithms are applied 
offline, in such a case it is not necessary to include a exploration 
strategy. 


3.2.1. Q-learning 

Consider a Q-function approximator denoted by F and pa¬ 
rameterized by a c/-dimensional vector 0. Every possible vector 
of parameters 0 provides an approximated representation of a 
corresponding Q-function: 


Qis,a) - Fgis,a) 


(6) 


where the symbol " denotes approximation. In general, the ap¬ 
proximator F can be nonlinear in the parameters. However, in 
some algorithms (e.g., Q-learning) linear approximators are of¬ 
ten preferred because they provide better convergence and sta¬ 
bility properties |4^5i|. A linearly parameterized approxima¬ 
tion of the Q-function is expressed as: 


Fgis, a) - ^ (pi(s, a)0i - (p^(s, a)0 


(7) 


1=1 


where ^is,a) - {(piis,a),... ,(pdis,a)'\ is the vector of basis 
functions (also called features 14811 1 that are combined using 
the c/-dimensional vector of parameters 0. A common approach 
to define the basis functions consists in using a regular grid of 
Gaussian radial basis functions (RBFs) spanned over the state- 
action space 04111521] . In such a case, for some state-action pair 
X - (s, a), the vector of basis functions is: 
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( 8 ) 

where each RBF is defined by its position or center c and vari¬ 
ance (T~. 

The Q-learning algorithm starts with an arbitrary approxi¬ 
mation of the optimal Q-function, i.e., an arbitrary vector of 
parameters 0o. Then, it uses the data from each transition 
[sk, Uk, rk+i, s'l^^Q to update the parameters using the following 
rule: 


0j+i ^0j-\-a 


rk+i -i-ymax<p'^(sk+i,a')0j 


-<f>'^isk,ak)0j\^isk,ak) (9) 
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where the index j — 1,.. .,p corresponds with the number 
of transitions in the data set and a is the learning rate. This 
learning rule updates the estimation of the optimal Q-function 
incrementally. Moreover, each update requires little computa¬ 
tion and memory resources, which makes possible to apply Q- 
learning in real-time. On the other hand, the algorithm presents 
two possible drawbacks: (i) it generally requires many tran¬ 
sitions to obtain useful policies, and (ii) the function approx¬ 
imator should be parametric and typically linearly parameter¬ 
ized I 


3.2.2. Fitted Q iteration 

Fitted Q iteration (FQI) is a batch RL algorithm whose main 
feature lies in the way that it handles the experience Un¬ 
like incremental algorithms, FQI uses the complete set of tran¬ 
sitions each time that updates the estimation of the optimal Q- 
function. Although this process involves more computation, it 
allows to extract more information from the stored experience. 
Consequently, FQI is more data-efficient than other RL algo¬ 
rithms. This feature makes FQI a very suitable algorithm in 
many application domains. For example, in the problem tack¬ 
led here, patients are modeled as the environment and the agent 
has to estimate optimal drug doses. Acquiring experience in 
this context entails administering a dose, waiting until it takes 
effect and measuring the variables that define the new patient’s 
condition. This process is expensive, both in time and money. 
Thus, reducing the quantity of data required by the algorithm 
can be crucial. 

Given a fixed set £) = {(i^,j = l,...,p)of 
p transitions and an arbitrary initial Q-function Qo (e.g., equal 
to zero everywhere on 5 x A), FQI starts by initializing an ap¬ 
proximation Qo of the Q-function Qo, with Qo(s, a) = 0 for all 
(s, a) e S xA. It then iterates over the following three steps OSll : 


1. n <— n H- 1 

2. Build the training set = {{inpuF, target^), j - 

1,... ,p) based on the function Qn-\ and on the full set 
of transitions D: 


inpuF — 

target = W +jmaxQ„^i(s-! a') 


( 10 ) 

( 11 ) 


3. Use supervised learning to induce from TS the function 
&i(s, a) 

Qnis,a) = rk+i +ym&xQn-i(sk+\,a') (12) 

a' 

The iterative process can be stopped simply by establishing a 
maximum number of iterations. Another possibility is to fix 
a threshold value ^ > 0 and stop the loop when the distance 
between two consecutive estimations of the optimal Q-function 
drops below the threshold, i.e., \Qn - Qn-\\ < ^ 1331]. 

FQI, similarly to other RL algorithms, requires a function ap¬ 
proximator to represent large or continuous Q-functions. How¬ 
ever, on the contrary to Q-learning, it does not impose any con¬ 
straint on the kind of approximator. In fact, at each iteration n 
it is possible to change the approximator in order to adapt the 


resolution (or complexity) of the model so as to reach the best 
bias/variance tradeoff l33n . Although FQI has been success¬ 
fully combined with many approximation methods (e.g., neu¬ 
ral networks |53] or linear regression a technique known 
as extremely randomized trees 15411 h as shown better perfor¬ 
mance than other approaches l33[ 13611 . Therefore, this tree- 
based method, whose details are introduced in the next section, 
was used in the experiments. 

Despite FQI is more data-efficient than other RL algorithms, 
the number of transitions required to learn optimal policies 
grows quickly with the state space dimensionality due to the 
“curse of dimensionality” 1^ . This feature is common to all 
RL algorithms, including also Q-learning. Thus, reducing the 
number of state variables as much as possible is an important 
issue. To this end, similarities among patients were exploited 
by k-means clustering analysis before applying any learning al¬ 
gorithm (see Section|3 15611 . 

4. Extremely randomized trees 


Extremely randomized trees Il54l] are a tree-based ensemble 
method for supervised classification and regression problems. 
It can be considered as an improved version of the popular tree 
ensemble method Tree Bagging S. The algorithm developed 
to compute this kind of ensembles is called Extra-Trees and, 
like Tree Bagging, it works by building several (M) trees. A 
key difference between both approaches lies in the sample used 
to compute the trees. While Tree Bagging uses a bootstrap sam¬ 
ple, Extra-Trees uses the complete data set to built each tree. 

Similarly to standard regression trees, each tree is composed 
of decision nodes, where each node contains a split (or test) of 
an attribute. The value where such attribute is split is known 
as cut-point. In order to define a decision node, Extra-Trees 
generates K splits by choosing K attributes at random and for 
each attribute a cut-point at random. Then, it calculates a score 
(based on the explained variance) for each of the K candidate 
splits and selects the split that obtained the maximum score. 
This process is repeated until the number of elements in the 
node is less than the parameter ifs^l . The algorithm has 
three parameters that need to be specified: the number M of 
trees to build the ensemble, the number K of candidate tests at 
each node and the minimal leaf size 


5. Erythropoiesis model under darbepoetin alfa treatment 

The ability of RL to compute treatment policies was assessed 
through simulations. Experiments were based on a computa¬ 
tional model that describes the effect of darbepoetin alfa on 
the Hb level. This section presents the main characteristics of 
the model in order to provide insight into the tackled clinical 
problem. [Appendix A| gives a more detailed description of the 
model. 

Several theoretical pharmacodynamic models describing the 
hematological response to different kinds of ESAs have been 
developed during the last decades 115814631] . The model in¬ 
troduced in this section is focused in patients undergoing 
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hemodialysis who are treated with intravenous darbepoetin alfa, 
a second generation ESA drug. The hematopoietic cell popula¬ 
tions are natural examples of biological systems governed by 
lifespan-based processes of cell proliferation, differentiation, 
maturation, and senescence. The Hb level is proportional to 
the number of erythrocytes, which are produced primarily from 
stem cells in bone marrow. During the process of maturation, 
stem cells undergo a series of differentiations. When they reach 
the stage of reticulocyte (immature erythrocytes), they begin to 
circulate in the blood and, after 1-2 days, these ultimately be¬ 
come mature RBCs (erythrocytes). In patients with CKD, ery¬ 
throcyte lifespan is approximately 70-90 days lls^ . The process 
of erythrocytes production (erythropoiesis) is regulated by the 
hormone EPO, which is produced in the kidneys. Darbepoetin 
alfa is a synthetic form of EPO that stimulates erythropoiesis 
by the same mechanism. 

Hb concentration dynamics following the administration of 
darbepoetin can be described through a multi-compartment 
model 1^ . The different cell types involved in erythropoiesis 
are grouped into population classes (or compartments) accord¬ 
ing to their characteristic properties with respect to interaction 
with EPO. The number of cells in all compartments depends on 
the plasma concentration of EPO, which consists of the sum of 
the naturally produced EPO, Ep, and the exogenous adminis¬ 
tered EPO (darbepoetin alfa), E: 

E,o,^E + Ep. (13) 


The rate of endogenous EPO production in the kidney is as¬ 
sumed to be constant. 

In the case of an intravenous administration, the total amount 
of darbepoetin alfa is injected into a vein within a very short 
time interval so that, without loss of generality, it can be as¬ 
sumed that the amount of the exogenous hormone in plasma at 
fo, the time when the administration takes place, is exactly equal 
to the amount of exogenous hormone administered. The result 
is a sudden rise of the hormone plasma concentration followed 
by an exponential decay described by a first order ordinary dif¬ 
ferential equation with a constant elimination rate 

£'(0= ^l0g(2)£(f), E(fo)=^, (14) 


where E{f) stands for the concentration of exogenous hormone 
at time f, Dq the drug dose at fo, and Vd is the volume of distri¬ 
bution at steady state, which is set to Vd - 52.4 165 1. 

The total Hb level at time t is proportional to the concentra¬ 
tion of erythrocytes R{t): 


Hb{t)^MCliR{t), (15) 


where R(t) is the plasma concentration of RBCs and MCH de¬ 
notes the mean corpuscular Hb concentration llh^l . chosen as 


MCH = 


2.7 g ■ (10"cells) * for men. 
2.4 g ■ (10"cells) * for women. 


(16) 


Appendix A In addition to endogenous EPO (Ep) and the 


mean corpuscular hemoglobin concentration (MCH), the indi¬ 
vidual response of each patient to darbepoetin alfa is defined 
by two other parameters: Cp, a constant that determines the 
flow of cells in the first compartment of the model (denoted by 
P), and Cr, which plays a similar role in the last compartment 
(denoted by R) (see Appendix A for more details). These pa¬ 
rameters can be adjusted using clinical and laboratory data of 
patients. Matlab 7.14 (R2012a) was employed to adjust the 
parameters and perform simulations. The Matlab DDE solver 
dde23 was used to compute the solutions of the system of dif¬ 
ferential equations. 

This computational model, as any other model, it is a simpli¬ 
fication of the real problem. It makes two basic assumptions: 
first, each patient maintains a stable level of inflammation, and 
second, the availability of iron for erythropoiesis is constant. 
Generally, these assumptions are not met by all patients during 
all treatment. Nevertheless, the model is able to capture the het¬ 
erogeneity in the response to darbepoetin alfa and its long-term 
effects, two factors suggested as principal causes of Hb cycling. 
Therefore, the model is useful to evaluate the performance of 
RE in preventing Hb cycling. 


6. Problem formulation using the MDP framework 


6.1. Anemia management as a sequential decision-making 
problem 

The symptoms of anemia and the response to the treatment 
often vary depending on several factors, such as the physical 
characteristics of the patient (weight, age, sex, etc.) 16711 . de¬ 
gree of kidney disease ll68n or comorbidities 11691] . During the 
treatment period, the clinical evolution of each patient is typi¬ 
cally monitored by monthly reviews. Each review consists of a 
laboratory test, which measures several variables to assess the 
patient’s condition, and the drug prescription until the next re¬ 
view. Therefore, each patient generates a sequence of observa¬ 
tions and treatments of the form I 


Oj, fj, o^, f^, • • • ,oij.,a!j.,o!j.^^ (17) 

where stands for the condition of the j-th patient at month k, 
stands for the dosage t for the same patient and month, and T 
is the duration of the whole treatment. 

A sequential decision-making problem can be easily derived 
from this set of sequences, where actions (or decisions) cor¬ 
respond to the doses prescribed to each patient in the differ¬ 
ent months (or stages). Before applying any RE method, it is 
necessary to define a function s(oi,ai,-- - ,Ok) that assigns a 
state to the patient’s current history and a scalar reward func¬ 
tion p{sk,ak, Sk+i) that evaluates the transition from Sk to 
after taking the action aic. The result of applying both functions 
to (fTTl i is a sequence of states, actions and rewards ll^ : 

ip aj, Lj, ^^2, r^, • • • ,Sj,aj,rj (18) 


The complete system of delay differential equations (DDE) that These sequences can be rewritten as a set of transitions 
describes the cell dynamics in each compartment is included (s, a, r, s'). Then, EQI can be applied to compute a policy that 
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selects optimal drag doses (in view of the data available) de¬ 
pending on the patient’s condition. 

The rest of this section details the definition of states, actions 
and reward function in the context of anemia treatment. 


6.2. State and action spaces 


The MDP framework assumes that the state transition dy¬ 
namics and the reward distribution has the Markov property, 
that is, given the current state s^ and action a^, the next state 
Sk+i and reward are conditionally independent of all previ¬ 
ous states, actions and rewards 114111 . An obvious way of defin¬ 
ing a state representation that holds the Markov property is to 
include all past history in s^. In practice, this is not feasible 
due to two reasons: (i) as stated in Section [3.2.21 the amount 
of experience needed to learn useful policies grows exponen¬ 
tially with the number of state dimensions due to the curse of 
dimensionality, and (ii) the space required to store experience 
would grow indefinitely as k grows. Therefore, the state should 
contain enough features to provide a “good summary” of the 
past history and, at the same time, it should be as compact as 
possible to allow the learning of useful policies from a lim¬ 
ited amount of experience. Despite that RL theory assumes the 
Markov property, most practical problems that may not fully 
satisfy this condition can be solved successfully by applying 
RL methods 14111 . For other cases, the MDP framework can be 
extended to a more general framework using partially observ¬ 
able MDPs (POMDPs) ll^ . 

In the anemia management problem, the state space of the 
MDP was expressed as 


s — [Hbk, AHbk, DAk, DAk^ 2 , Ptitientgroup] 

where Hb^ measures the degree of anemia in the current month 
k. AHbic, defined as the difference between the current Hb level 
and the previous one, indicates the Hb trend. is the dose 
of darbepoetin alfa. Due to the long-term effects of darbepoetin 
alfa, Hb at month k is not only influenced by the last drug dose, 
but also by the doses during the two previous months, therefore 
DAk-\ and DAk-i were also included in the state definition. On 
the other hand, given that the treatment should be tailored to 
the individual characteristics of each patient, information about 
such characteristics must be included in the state definition. The 
computational model employs four variables that describe each 
patient, namely, MCH, Ep, Cp and Cr. These variables could 
be directly introduced as state variables, however, prior knowl¬ 
edge about the problem was used to reduce the state space di¬ 
mensionality. MCH is a dichotomous variable that depends on 
the patient’s gender. Therefore, it was decided to remove this 
variable and compute a policy for men and another for women. 
The remaining variables {Ep, Cp and Cr) were analyzed us¬ 
ing k-means clustering 15611 . The objective was to find groups 
of patients that respond to the treatment in a similar way and, 
then, introducing into the state definition only the information 
about the kind of response instead of the patient’s character¬ 
istics directly. Thus, the output of the clustering algorithm, 
Patientgroup, also formed part of the state space. The underlying 


assumption is that the same treatment will produce a compara¬ 
ble effect in patients with similar characteristics. 

Regarding the action space, a set of discretized actions was 
used 

A = {0,0.25,0.50,0.70,1) yug/kg per week. 

These values correspond with the doses that can be adminis¬ 
tered to the patients. It should be noted that these weight- 
normalized doses are different for each patient depending on 
the body weight. In practice, using a discrete set of doses can 
be useful due to the fact that darbepoetin alfa is supplied in pre¬ 
filled syringes and, if the syringe is partially injected, the rest 
will be wasted. 

6.3. Reward function 

The reward function defines the agent’s goal. The aim of 
ESA treatment is to maintain Hb levels in a healthy range. Ac¬ 
cording to the European Best Practice Guidelines for the Treat¬ 
ment of Anemia in Chronic Kidney Disease lull, Hb concen¬ 
tration should be >11 g/dl for all patients, and no higher than 
12 g/dl for patients with comorbidities such as cardiovascular 
disease, diabetes, etc. Thus, the Hb target range used in this 
work was [11,12] g/dl. On the other hand, abrupt changes of 
Hb should be avoided. Specifically, changes (increments and 
decrements) close to 2 g/dl per month increase the risk of car¬ 
diovascular episodes 111 in . 

In order to satisfy both goals, the reward was designed as a 
piecewise function that depends on the current Hb level Hb^. 
The idea is that when Hb^ is close to the target, then the policy 
should try to reach the target range in the next month. However, 
if the difference between Hb^ and the target is too large, the Hb 
change required to reach the target in the next month can be too 
abrupt and it should be avoided. 

The following bell-shaped reward was employed when 
10.5 < Hbk < 12.5: 


Pmisk, Qi) = 1 - tank 


Hbk^i-n.5 


g 


(19) 


w = tanh * 




0.01 


where g - 0.5 is a parameter that controls the slope of the func¬ 
tion. This function assigns a maximum reward when Hbk+\ is 
equal to 11.5 g/dl (the center of the target range), and it de¬ 
creases smoothly to zero as Hbk+i differs from 11.5 g/dl, in¬ 
dicating that Hb levels near the target are preferable. Eig. |2a] 
shows the shape of pm- 

If Hbk ^ 12.5, the goal was to achieve a decrement of 1 g/dl: 


P-hHh{sk,ak) - 1 - tanh 


AHbk^i + 1 


g 


( 20 ) 


where g and w are the same as in Eq. ([T9l l and AHbk+i is the 
difference of Hb between the current and the next month, i.e, 
Hbk+i — Hbk- Similarly, if Hbk ^ 10.5, the agent’s goal was to 
obtain an increment of 1 g/dl. The complete piecewise reward 
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function is given by 


( PHb 

if 10.5 < Hbk < 12.5 


p{Sk,ak) = \ P-AHh 

if Hbk > 12.5 

(21) 

{ P+AHh 

if Hbk < 10.5 



Fig. |2]depicts each subfunction separately and the complete re¬ 
ward function versus Hbt and Hbk+i- 

1 . Experiments 

The purpose of the experiments was to assess the proposed 
method by comparing the quality of the learned policy with the 
popular Q-learning algorithm and with a standard protocol of 
dose adjustment. As illustrated in Fig.|3] experiments were car¬ 
ried out in three stages. 

First, the computational model was used to generate the ex¬ 
perience (data) necessary to apply the RL algorithm. Specifi¬ 
cally, 5000 hemodialysis patients treated with darbepoetin alfa 
during 30 months were simulated. Second, the data were pro¬ 
cessed in order to obtain a set of states, actions and reward. 
Then, a drug administration policy was learned from data using 
FQI. For comparison reasons, this learning process was also 
carried out using Q-learning. Finally, in the third stage, a new 
cohort of 60 patients was simulated in accordance with three 
treatment strategies: the policy learned using FQI, the policy 
learned using Q-learning, and the treatment recommended in 
the protocol. 

7.1. Experience 

The computational model was used to generate a data set that 
mimics the medical data gathered during common clinical prac¬ 
tice. The model uses four parameters (MCH, Ep, Cr and Cp) to 
describe the patient’s individual response to anemia treatment. 
MCH was chosen as indicated in Eq. (fTbl l. while the remain¬ 
ing parameters were adjusted employing data extracted from 
the medical records of 128 patients undergoing hemodialysis 
and receiving darbepoetin alfa. Data were collected between 
January 2008 and December 2010 in three hemodialysis cen¬ 
ters located in Italy. Table [1] shows the baseline characteristics 
of patients used to adjust the model parameters, and Table |2] 
shows a summary of the adjusted parameters. 

As the proposed method was applied separately for men and 
women (see Section l6^ . the data set was split by gender. Due 
to the high level of similarity between the results obtained in 
both groups, the rest of the work is focused only on the men’s 
group. After data splitting, the adjusted parameters were avail¬ 
able for 69 male patients. In order to increase the population 
size, linear interpolation was applied between the parameters 
of each patient and a patient randomly selected among its 10 
nearest neighbors. This process allows to generate artificial pa¬ 
tients that are similar to the real ones and, at the same time, it 
introduces a certain degree of randomness in order to capture 
the variability of the characteristics among different patients. 
The process was repeated until the parameters corresponding 
to 4931 new patients were generated, making a total of 5000 


Table 1: Baseline characteristics of the population used to ad¬ 
just the model parameters. 


Chai'acteristic 

(n = 128) 

Age (y) 

58.53 ± 15.15 

Sex (% male) 

53.91 

Height (m) 

1.62 ±0.09 

Weight (kg) 

67.97 ± 12.61 

Treatment 

Darbepoetin alfa (///kg/week) 

0.29 ±0.17 

Iron IV (mg/week) 

34.74 ± 32.08 

Laboratory data 

Hemoglobin (g/dl) 

11.80± 1.16 

Ferritin (ng/ml) 

430.90 ± 178.96 

SeiTim albumin (g/dl) 

4.09 ± 0.29 

Phosphate (mg/dl) 

4.52 ± 1.03 

Leukocytes (cells/mm^) 

6584.1 ± 1260.9 

Data are means ± SD. 

Table 2: Summary information on 

the model parameters. 


Parameters 

Value (mean ± SD) 

Ep 

0.3588 ± 0.0753 

Cr 

0.1372 ±0.0520 

Cp 

0.2014 ± 0.0640 


patients. Then, patients’ process over time was simulated dur¬ 
ing 30 months of treatment q The dose of darbepoetin alfa 
administered during each month was randomly selected among 
the discrete set A, which introduced a high variability of doses. 
If all patients had followed an specific treatment protocol, the 
learned policy would be strongly biased by that protocol. In the 
actual case, physicians’ prescriptions are based on clinical pro¬ 
tocols and in their own experience, therefore it is common to 
find different doses for patients with similar conditions. Never¬ 
theless, in general, doses prescribed by physicians are closer to 
the optimal ones, which facilitates the learning process. 

Simulations generated a total of 5000 x 30 = 150000 ob¬ 
servations and treatments. Due to the randomness of the treat¬ 
ment, in some observations the Hb level was greater than 20 
g/dl. These data were removed because they represent unreal¬ 
istic situations. After this restriction, the data set consisted of 
138011 observations. 

7.2. Learning 

The data generated in the previous stage were used as input 
of the learning algorithms FQI and Q-learning. To this end, 
the sequences of observations and treatments were transformed 
into a set of transitions {s, a, r, s') following the methodology 
introduced in Section |6l Some of the state variables were ex¬ 
tracted using the clustering algorithm k-means. The rest of this 


^Given that the prevalence of CKD is between 0.1% and 0.2% of the total 
population, the size of the simulated population is typical from a region with 
about 5 million inhabitants. On the other hand, patients are often treated for 
extended periods of time due to the chronic nature of CKD, however, 30 months 
was enough the generate the amount of data required to learn useful policies 










Figure 2: Reward function. Subfigures (a) and (b) represent the piecewise reward as three separate subfunctions. Function (a) is 
applied when 10.5 < Hbk < 12.5 and it varies depending on Hbk+i. Functions (b) are applied when Hbk < 10.5 (continuous green 
line) or Hbk ^ 12.5 (dashed red line) and they vary depending on AHbk+i- Finally, (c) represents the complete reward function 
versus Hbk+i and Hbk- 


section discusses some practical issues of A:-means, FQI and Q- 
learning. 

7.2.1. k-means 

The ^-means algorithm employed in the clustering analysis 
starts with an initial estimation of q centroids and then uses an 
iterative method to optimize the cluster quality. Typically, the 
number of clusters q is selected empirically. In this work, the 
clustering analysis was repeated using values of q from 3 until 
10. For each value of q, the obtained clusters were analyzed us¬ 
ing silhouette plots 17111 ; finally, q - 5 was selected as the most 
suitable because it maximizes the within-cluster compactness 
and the separability among clusters. 


7.2.2. Fitted Q iteration 

FQI is based on an iterative procedure and the designer must 
decide when to stop the iterations. The algorithm starts with an 
arbitrary approximation of the optimal Q-function that is im¬ 
proved at each iteration. Thus, the number of iterations should 
be enough to allow convergence to the optimal Q-function. 
Convergence can be measured in terms of distance between 
consecutive approximations of the Q-function, defined as 


dist(2„,2„_i) = 


#(S‘xA) 


( 22 ) 



Figure 4: Fitted Q iteration convergence measured as the dis¬ 
tance between Qn and versus number of iterations. 


where 5' x A are the state-actions pairs contained in the set of 
transitions J^. 

Fig. |4] shows the convergence of FQI versus the number of 
iterations, where it can be observed that the approximated Q- 
function remains almost stable after 35 iterations. In view of 
this, the number of iterations was fixed to 40. On the other 
hand, the discount factor y was empirically set to 0.9, this value 
was enough to incorporate the long-term effects of each dose in 
the optimization process. 
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2. Learning 


3. Evaluation 


1. Experience 



Figure 3: Experiments were carried out in three stages. First, the computational model was used to simulate a cohort of patients 
that are treated during a period of time, gathering the records generated by each patient in a database. Secondly, the records were 
processed to produce a set of transitions of the form {s, a, r, s'), and then FQI algorithm was applied. For comparison reasons, the 
learning process was also repeated using Q-learning. Third, the policies learned from the data were evaluated and compared with a 
standard protocol using a new cohort of simulated patients. 


Extra-Tree algorithm parameters were chosen following the 
procedure described in 13311 . The number of trees in the ensem¬ 
ble was set to M - 50, the parameter K was set equal to the 
dimension of the input space, i.e, K - 6, and the minimal leaf 
size was selected among the values = [5,10,50,100] using 
cross-validation in each iteration of FQI. Experiments showed 
that these default values are a good choice 15411 . 


7.2.3. Q-leaniing 

Q-learning is also based on an iterative procedure that es¬ 
timates the optimal Q-function. However, as each transition 
{s, a, r, s') is only used one time, when the number of transitions 
is limited the algorithm simply iterates over the complete set of 
transitions. Thus, it is not necessary to implement a strategy to 
stop the iterative procedure. 

The Gaussian RBF network with fixed bases employed to 
approximate the Q-function requires the definition of the num¬ 
ber of Gaussian functions, their centers and standard deviations. 
This process typically requires trial and error experimentation 
with various configurations. The number of functions should be 
enough to provide a smooth interpolation over the entire state- 
action space. If the Gaussians are too peaked, it will be nec¬ 
essary to employ a very large number of functions to cover the 
entire space. On the other hand, if they are too flat, the RBF net¬ 
work will not be flexible enough to approximate abrupt changes 
of the Q-function. Thus, it is necessary to reach a compromise 
between locality and smoothness. 

After experimenting with several configurations, the selected 
RBF network architecture employed 4096 Gaussian functions 
whose centers were distributed in a regular grid over the state 
space. The inputs were normalized to the range [-1,1] and all 
the standard deviations were set to cr = 1.1. The discount fac¬ 
tor was fixed as in the FQI algorithm (y = 0.9). Additionally, 
Q-learning introduces a new parameter that should be tuned: 
the learning rate a. A large value of a usually results in faster 
convergence, but when it exceeds a certain critical value, the 
algorithm becomes unstable. Fig. |5] shows the convergence of 
Q-learning (measured using Eq. (i22l i) when a - 0.2. The figure 
suggests that Q-learning has not converged after 138011 itera¬ 
tions. 

It is worth noting that each iteration of Q-learning (Fig. |5]) 
employs only a transition of the data set; whereas in FQI 



Figure 5: Q-learning convergence measured as the distance be¬ 
tween Qn and Qn-\ versus number of iterations. 


(Fig.lU each iteration is based on the complete data set. Hence, 
despite the number of iterations shown in both figures, the con¬ 
vergence process of FQI is much larger than in the case of Q- 
learning. 


7.3. Evaluation 

The policy learned with the proposed methodology based in 
FQI was evaluated in a cohort of 60 new patients using the 
computational model. Similar to Section 17.11 the parameters 
corresponding to these patients were generated by linear inter¬ 
polation. The evolution of each patient was simulated during 
30 months of treatment using the drug doses indicated by the 
RF policy. For comparison purposes, the same cohort of pa¬ 
tients was simulated according to the policy learned with Q- 
learning and the treatment recommended by a standard proto¬ 
col. The protocol was extracted from the European Medicines 
Agency {I'M and it describes the dosage regimen for Aranesp™ 
(a commercial form of darbepoetin alfa) as follows: 


• The initial dose for patients on dialysis should be 0.45 
//g/kg body weight administered weekly. 

• If the increase in Hb is inadequate (less than 1 g/dl in four 
weeks) the dose should be increased by approximately 
25%. 

• The dose must not be increased more frequently than once 
every 4 weeks. 
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• If the Hb rises more than 2 g/dl in a period of 4 weeks, the 
dose should be reduced by 25%. 

• If the Hb level is greater than 12 g/dl, the dose should 
be reduced by 25%. After this reduction, if the Hb level 
continues rising, the treatment should be temporarily inter¬ 
rupted until the level starts to decline, moment in which the 
treatment should be resumed with a dose approximately 
25% lower than the previous dose. 


When the computational model that simulates the patients 
is initialized, the level of Hb is determined by the initial 
conditions. These conditions are equal for all the patients 
(see Appendix Ai, i.e., at month zero all the patients have the 
same Hb level. In order to have a cohort of patients with a 
heterogeneous initial state, the 60 patients were treated during 
four months with a random treatment before starting the evalu¬ 
ation process. Thus, the protocol recommendation for the ini¬ 
tial dose was never applied because the patients had already 
received some previous doses of ESA when beginning the eval¬ 
uation. 


8. Results and discussion 

This section assesses the proposed methodology by compar¬ 
ing the policy learned using FQI (ttfoi) with the policy learned 
using Q-learning {jiQ.ieaming) and extracted from the protocol 
(^protocol)- All results shown correspond with the cohort of 60 
simulated patients used for validation purposes and 30 months 
of treatment. 

As a first approach to compare the behavior of the three poli¬ 
cies, the Hb levels obtained with each policy were represented 
in a box plot. This kind of representation graphically depicts a 
set of data values through their quartiles, allowing to visually 
estimate the degree of dispersion and skewness. Fig. |6] shows 
the box plot corresponding to HQ^eaming, t^fqi and nprotocol- In 
the three cases the median falls within the desired range of Hb 
(indicated by two horizontal dashed lines at 11 and 12 g/dl) and 
the Hb levels are approximately symmetrically distributed. The 
main difference observed among the three policies is the large 
dispersion of 7iQ4earmng, which means that many of the patients 
treated with this policy have Hb levels away from the target 
range. In the other two policies there are also Hb levels sub¬ 
stantially different from the target range; however, these values 
can be considered outliers (shown as red crosses) from the com¬ 
plete set of observations and are likely due to the initial states 
of the patients. 

The low performance of Q-learning is probably a conse¬ 
quence of the limited amount of transitions available to estimate 
the policy. In principle, both RF algorithms (Q-learning and 
FQI) should be able to find similar policies in a given problem. 
In fact, other authors found that, given enough data, the policy 
learned by Q-learning was superior than the protocol i29ll3(lll . 
Nonetheless, a key advantage of FQI is that it makes a more ef¬ 
ficient use of the data, which can be crucial in problems where 
obtaining data is a non-trivial task. A second factor that may be 



Figure 6: Box plot representation of the Hb levels correspond¬ 
ing to 60 simulated patients during 30 months of treatment. Pa¬ 
tients were treated according to nQ.iearmng, ^fqi and nprowcoi- 
The two horizontal lines (dashed black) indicate the Hb target 
range. 


related with the superiority of FQI is its flexibility to approx¬ 
imate the Q-functions. In Q-learning it is necessary to define 
an a priori approximator structure that remains fixed over the 
learning process. On the contrary, in each iteration FQI selects 
the approximator structure that provides the best approximation 
of the current Q-function. 

Since FQI outperforms Q-learning considerably, the rest of 
this section is focused only on analyzing the differences be¬ 
tween the policy obtained using FQI and the protocol of dose 
adjustment. 

A limitation of the box plot representation is that the tempo¬ 
ral information is lost. In order to compare the performance of 
the two policies along time, a better approach consists in rep¬ 
resenting the Hb levels during the treatment. Fig. Q shows this 
information for ttfqj and nprowcoi- For the sake of simplicity. 
Fig. I?] represents the Hb of only a subset of 19 patients ran¬ 
domly selected. The figure also includes two horizontal lines 
(dashed black) at 11 and 12 g/dl that indicate the target range. 
During the first months of treatment, Hb was shifted towards 
the target range with either of the two policies. In this sense, 
npQi was more effective than nprowcoi because, in general, it 
required less time to reach the target. After seven months of 
treatment, several signs of Hb cycling appeared in some of the 
patients treated with nprowcoi- On the contrary, npgi was able to 
prevent or drastically reduce Hb cycling. As expected, the ex¬ 
treme values of Hb previously shown as outliers in the box plot 
are mainly concentrated in the first months of the treatment. 

Fig. 0 also shows the variation of Hb over time, but in this 
case in terms of the mean (solid lines) and standard deviation 
(shadow areas) for the complete group of patients. Again, it 
can be observed that npgi stabilizes the Hb levels within the 
target range whereas the protocol produces oscillations. The 
large standard deviation corresponding to nprowcoi indicates that 
there are patients with dangerous Hb levels in all months. Al¬ 
though the Hb cycling produced by nprowcoi diminishes slightly 
over time, after 30 months it is still noticeable. Hb variability in 
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Patients treated according to the policy learned using FQI 



(a) 

Patients treated according to the dose adjustment protocol 



(b) 

Figure 7; Hb level evolution during 30 months of treatment for 19 simulated patients randomly chosen from the test group. In (a), 
the patients were treated according to the policy learned with the proposed method, nfQj. While in (b) they were treated according 
to the policy extracted from the protocol, ji protocol- The two horizontal lines (dashed black) indicate the Hb target range. 


patients treated with nfQi was significantly lower (p < 0.0001, 
one-tailed paired F-test) than those treated with nprotocol- 

A second metric employed to compare both policies was the 
number of months or observations in which the patients pre¬ 
sented an adequate Hb level. Fig.|9]presents this information by 
means of a bar chart. Five ranges of Hb grouped into three cat¬ 
egories were defined: the category suitable includes the target 
range [11,12] g/dl; unsuitable consists of the two ranges con¬ 
tiguous to the target range, that is, (10,11) and (12,13) g/dl; and 
dangerous contains the rest of Hb levels. Fig.|9]shows two bars 
corresponding with jifQ, (light gray) and nprotocol (dark gray) 
in each one of the five ranges. Ideally, all patient observations 
should be within the category suitable. When the patients were 
treated with their Hb level was within the target range 

in 82.1% of the observations, which represents an important 
improvement compared with the 54.5 % achieved by 7Tpr„tocoi- 
The rest of patient observations were mainly located in the cat¬ 
egory unsuitable, specifically, 11.02% for nfQj and 34.1% for 
^protocol- This percentage indicates that npr„tocoi also achieved 
a reasonable outcome. Finally, a minor percentage of patient 
observations was in the category dangerous', however. Figs. Q 
and [8] shows that these observations corresponds mainly with 


the initial months of treatment, when the drug administered has 
not yet produced any effect. 

In addition to stabilizing the Hb level inside the target range, 
the treatment must avoid abrupt Hb changes. The difference 
between jifQi and t: protocol concerning this aspect was negligi¬ 
ble. Both policies were really effective in avoiding Hb changes 
greater than 2 g/dl per month, more than 99.5% of state transi¬ 
tions met this condition. 

Due to the high costs of darbepoetin alfa, another point to 
take into account when comparing both policies is the quan¬ 
tity of drug used. There was a highly significant difference 
(p < 0.0001, one-tailed paired t-test) of 5.13% between the 
mean dose recommended by nprotocol (0.39 //g/kg/week) and 
TtFQi (0.37 z^g/kg/week). Thus, the treatment recommended by 
the RL policy not only produced better outcomes but also gen¬ 
erated lower costs. 

Finally, a particular case was studied in detail to obtain more 
insight into the behavior of each policy. Table [3] shows how the 
Hb level of a patient and the dose of drug administered varies 
along 15 months of treatment. The goal of this table was to 
compare the treatment recommended by both policies in a case 
where nprotocol caused Hb cycling. The months where Hb level 
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Figure 8: Mean Hb level (solid lines) and standard deviation 
(shaded areas) over time for 60 patients simulated according 
tifqi (in green) and nprotocol (in red). 



Hemoglobin ranges (g/dl) 

Figure 9: Comparison of the observed monthly Hb for the pa¬ 
tients simulated following ttfqi and nprotocol during 30 months. 
Ideally, all Hb observations should be in the range [11,12] g/dl. 


was within the target range are highlighted in bold. It can be 
observed that, in the fifth month, nprotocol started to increase the 
darbepoetin dose because Hb <11 g/dl, and it continued in¬ 
creasing the dose in months 6 and 7 because the Hb level had 
not yet reached the target range. Then, after reaching the tar¬ 
get (month 8), the Hb level continued rising despite drug dose 
was decreased (months 9 and 10). This phenomenon is due 
to the delay between drug administration and its effects in Hb. 
On the other hand, the doses administered by ttfqi suggest that 
the long-term effects are taken into account. For example, in 
months 7 and 8 the dose is maintained constant despite the low 
Hb level. In consequence, the Hb level in the next months in¬ 
creases and is stabilized within the target. 

The experiments reported in this paper were based on a 
computational model that simulates the patients’ response to 
the treatment with darbepoetin alfa. Although the proposed 


Table 3: Comparison of the treatment and clinical evolution of 
a patient treated according to jifqi and ji protocol- The Hb levels 
within the target range are highlighted in bold. 


Month 


FQI 

Dose adjustment protocol 

Hb 

(g/dl) 

Drug dose Hb 
(/ig/kg/week) (g/dl) 

Drug dose 
(/Jg/kg/week) 

1 

12.65 

0.50 

12.65 

0.75 

2 

12.78 

0.75 

12.91 

0.56 

3 

11.77 

0.75 

12.43 

0.42 

4 

11.28 

0.50 

11.36 

0.42 

5 

11.80 

0.50 

10.44 

0.52 

6 

11.43 

0.50 

10.13 

0.65 

7 

10.89 

0.50 

10.57 

0.82 

8 

10.91 

0.50 

11.52 

0.82 

9 

11.06 

0.50 

12.72 

0.61 

10 

11.22 

0.50 

13.50 

0.46 

11 

11.39 

0.50 

13.14 

0.34 

12 

11.56 

0.50 

12.11 

0.34 

13 

11.73 

0.25 

11.24 

0.34 

14 

11.74 

0.50 

10.90 

0.43 

15 

11.19 

0.25 

11.00 

0.54 


methodology is valid for the case of actual patients, certain as¬ 
pects should be modified. The main differences between simu¬ 
lated and real patients are discussed below. 

The inclusion of the model parameters (MCH, Ep, Cp, Cr) 
in the definition of the state space is likely the most important 
difference between applying the proposed system to simulated 
and real patients. These parameters, which are related to the 
patients’ individual characteristics, are used to find groups of 
patients that respond to the treatment in a similar way. In prac¬ 
tice, the parameters cannot be directly measured. One option 
would be to adjust them as in the simulated case; however, a 
more effective solution is to employ the variables commonly 
measured in the monthly reviews which provide the same in¬ 
formation. For example, the level of C-reactive protein, serum 
albumin and leukocytes can be used as an indicator of inflam¬ 
mation level. Thus, the clustering analysis should be performed 
using those variables as inputs instead of the model parameters. 

Some elements of the MDP, such as the reward function and 
the discount factor, should be carefully chosen to provide the 
desired outcomes. When using a computational model, trial- 
and-error procedures can be used to adjust them. On the con¬ 
trary, this approach is not viable in real domains. Thus, despite 
the valuable experience gained with simulated patients, the ex¬ 
pertise and advice of physicians are still necessary to design the 
MPD in the real case. 

Finally, a third issue that should be noticed is the two as¬ 
sumptions made by the model: stable level of inflammation and 
constant iron availability. As previously mentioned, these as¬ 
sumptions may are not met in the real case. The only require¬ 
ment to overcome both assumptions is that the clinical data em¬ 
ployed to learn the policy must contain enough examples of 
patients in which assumptions are violated, i.e, with variable 
levels of inflammation and iron availability. 
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9. Conclusions and future work 

This work has proposed a methodology based on RL to op¬ 
timize anemia treatment in hemodialysis patients. After formu¬ 
lating the problem using the MDP framework, RL was able to 
learn automatically near-optimal treatments from clinical data. 
Contrary to other techniques to solve MPDs, RL does not re¬ 
quire a complete knowledge of the system dynamics, a feature 
that can be crucial in medical problems. More specifically, the 
methodology uses the algorithm FQI, which stands out for its 
ability to make an efficient use of data. FQI was combined 
with a function approximator based on regression trees in or¬ 
der to deal with the continuous state space and to generalize the 
learned policy to the cases not covered by the data set. The state 
variables of the MDP were extracted partly from the laboratory 
tests and partly from a clustering analysis of the patient’s main 
attributes. 

The proposed methodology was evaluated through a compu¬ 
tational model that describes the effect of darbepoetin alfa on 
the concentration of Hb. In addition to FQI, the experiments 
were also performed using the well-known Q-learning algo¬ 
rithm. A standard protocol of dose adjustment was used for 
baseline comparison. The quality of the policy obtained with 
Q-learning, nQ 4 earning was considerably inferior in comparison 
to the other two policies {jifqi and n protocol)- When comparing 
TTfQj and nprotocoh the policy obtained with FQI increased by 
27.6% the proportion of patients with an adequate level of Hb 
and, at the same time, it reduced the amount of drug used by 
5.13%. 

The simulation results suggest that the FQI policy can deal 
with the long-term effects of darbepoetin alfa and the high vari¬ 
ability in the patient’s response. These features, which are ab¬ 
sent in standard dosing protocols, had been suggested as a ma¬ 
jor cause of Hb cycling. As a result, the proposed methodol¬ 
ogy was more effective than the standard-use tested protocol in 
maintaining stable Hb levels and preventing Hb cycling. On 
the other hand, the drug is prescribed in a more efficient way 
since the treatment achieves better outcomes with less amount 
of darbepoetin alfa. 

The computational model used in the experiments has several 
limitations owing to the assumptions on which it is grounded 
and, therefore, does not represent all possible patients. Never¬ 
theless, it reproduces some important difficulties present in ac¬ 
tual cases that may cause Hb cycling. Thus, although prospec¬ 
tive validation is required, experiments have shown the poten¬ 
tial benefits of RL in anemia treatment. 

The positive results obtained in this work using simulated pa¬ 
tients have motivated further research in applying the proposed 
methodology to actual patients. Currently, a tool for clinical de¬ 
cision support based on FQI is being validated through a clini¬ 
cal evaluation in five hemodialysis centers from three European 
countries. 

Although this work has been focused in renal anemia, the 
methodology can be extended to other types of anemia. For 
example, oncology patients who also receive darbepoetin alfa 
treatment, or even to other complex problems of drug adminis¬ 
tration such as warfarin therapy to prevent venous thromboem¬ 


bolism. Another interesting line of future research is to include 
other optimization aspects in the reward function, in addition to 
those related to the patients’ health, like the cost of the treat¬ 
ment. 
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Appendix A. Erythropoiesis model 


Many mathematical models have been developed to simu¬ 
late the process of erythropoiesis in different physiological sce- 
3- The most common approach is the use of 


age-structured models |75 -S- The model presented in this 
section is focused in patients undergoing hemodialysis who are 
treated with intravenous darbepoetin alfa. In addition, it incor¬ 
porates the effects of iron availability and level of inflamma¬ 
tion, although both are considered constant. Similar to ifs^ . 
the stimulating action of the drug is described through a multi¬ 
compartment model. Three different cell classes or compart¬ 
ments are considered; 


• P is the bone marrow concentration of progenitor cells 
(BFU-E and CEU-E) and precursor cells (proerythroblast 
and basophilic erythroblast) that depend on EPO for their 
survival and differentiation to the next compartment. 

• M comprises the remaining erythroblasts (polychromatic 
and orthochromatophilic) and marrow reticulocytes which 
are iron-responsive. As it is assumed that all patients have 
sufficient iron available, this compartment only serves as 
delay on the differentiation until the next compartment. 

• P is the plasma concentration of red blood cells. 

The equations governing the evolution of the number of cells 
in each population are simply balance equations ifs^ . Each 
compartment of cells is fed by an entering flow of fresh cells 
and is emptied by the outgoing flow of differentiated or apop- 
totic cells. 

The fundamental assumption is that every cell in each popu¬ 
lation lives for the same period of time, which is constant and 
denoted by Tp for cells P and Tr for cells R if^ . This as¬ 
sumption determines the cell elimination rate since the number 
of cells that are lost at time t must be equal to the number of 
cells that are born at the same time delayed by the appropriate 
lifespan. The loss process is modeled by means of weighted av¬ 
erages of the previous day incoming rates in order to take into 
consideration that cells actually have different exposition times 
to the drug. Such exposition time varies according to their in¬ 
ternal maturity level at the time of the administration. 

The flow entering the P compartment depends on the progen¬ 
itors response to the stimulatory effect of erythropoietin. Ac¬ 
cording to ifs^ . this response can be described by the Hill func¬ 
tion H{Etot) Etot/(E 50 + Etot), where Etot, which is defined as 
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in Eqs. (fljl l and (fT4li . represents the total plasma concentration 
of EPO, and £50 = lOO/Vd- The sensitivity £50 is the half- 
maximal effective concentration, a parameter commonly used 
as a measure of drug’s potency. 

The outgoing flow of the P compartment is also the feeding 
flow to the R compartment delayed by Tm, as it is assumed that, 
up to a proportionality factor (describing a proliferation activity 
of M cells), as many M cells as P cells are produced. 

At the last stage of their lifespan, RBCs become senescent 
and are cleared from the blood. The rate at which RBCs are 
cleared is a weighted average of the previous incoming rates. 
Such a average is a convex combination with coefficients given 
by a Gaussian distribution with mean equal to the RBC mean 
lifespan and variance 30. The Gaussian distribution indicates 
that the drug effects are stronger on T^-day old cells. 

The evolution laws of the compartment P and R are the fol¬ 
lowing; 


P'{t) = C„—— P{t) 

'^£5o + £tot(0 

T 


R'(t) = G 


Etot(f -Tj) 

Tp £50 + Eioiit - Tj) 

Etot(t - (Tp + Tm)) 


C, 


E 50 + £tot(f - (Tp + Tm)) 

Tp+T^+Tfi 


Pit-Tj) 

P(t - (Tp + Tm)) 


Etoiit - Tj) 

Zj - T j) 


S P.M.R r,=T7T'T„+1 ' -^50 + EUt - Tj) 


where gpj G(Tj - (Tp + Tm))), G{T) is the Gaussian distri¬ 
bution with mean T« and variance 30 evaluated at time T, and 
S p.m,R is ^ normalization factor given by 


Tp+Tw+Tfl Tfl 

Sp.i,R ^ G[Tj - (Tp + Tm)) = J] G{Tj). 

7-;=Tp+Tm+1 Tj=l 


In the present study, (Tp, Tm,Tp) = (9,4,70). At the time 
fo of the first administration, £0 = 1 and Rq = 1 (i e-j there 
are 10^' cells maturing from class P to R). The remaining pa¬ 
rameters, Ep, Cp and Cr, are estimated for each patient. The 
solution of the model, given in Eq. O, is fitted in the least 
square sense using the first six Hb levels of each patient mea¬ 
sured through laboratory tests. Then, the model can be used to 
simulate the response of that patient to the treatment. 

The basic assumptions of the model are that each patient 
maintains a stable level of inflammation and that iron availabil¬ 
ity for erythropoiesis is constant. However, dialysis patients of¬ 
ten have multiple comorbidities (such as hypertension, diabetes 
or cardiovascular disease) that may produce fluctuating levels 
of inflammation and reduce the availability of iron, which lim¬ 
its the usefulness of the model. 

On the other hand, when the assumptions are met, the model 
is able to reproduce the variability among patients in drug re¬ 
sponse and the long-term effect produced by the drug. Eig. IA.10l 
shows four cases in which the assumptions are reasonably well 
met. As it can be observed, in general, the model (continuous 
blue line) is able to simulate the Hb level measured by labora¬ 
tory tests (res asterisks). 
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Eigure A. 10: Hb level simulated using the model (blue line) 
and measured by laboratory tests (red asterisks) corresponding 
to four patients during approximately 26 months of treatment. 
In the four cases, model’s assumptions are approximately met. 
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